Before introducing the topic that will concern us, we will clean the workspace completely.

rm(list=ls())

Introduction and context of the topic

Imagine a company whose main field is big data related is looking for a new data science; after some time, when they have selected the most promising profiles, they want to distinguish which of them will be more profitable for them in the long run.
Logically, they want to reduce costs in training courses by hiring a person that has more options to stay in the company after a long time.

HR Analytics: Job Change of Data Scientists is a dataset that uses credentials, demographics, education, experience and other social considerations to predict which workers are more likely to stay in the company or leave to accept another position.
Moreover, thanks to this data we will be able to know other facts such as which aspects are more important for an employee to keep their data scientist position or drop out. Find the database ready to download in this link

Preparing R and data: format, NAs and feature engineering

We are going to load all libraries and set the working directory.

library(xfun)
library(rmarkdown)
library(factoextra)
library(dplyr)
library(ggplot2)
library(cluster)
library(dendextend)
library(psych)
library(tidyverse)
library(tidyr)
library(kernlab)
library(mclust)
library(NbClust)
library(GGally)
library(Amelia)
library(corrplot)

We will consider our full data is train_data (14 columns), because if we merge it with test_data (13 columns) we will have one important, missing column (target) in the whole dataset.
For our purpose, I consider more valuable the information provided by this variable than the few, non-characteristic observations we have in test_data.

setwd("C:/Users/lucia/OneDrive/Documentos/Universidad/Segundo curso/Aprendizaje")

data = read.csv("aug_train.csv",header=T, na.strings="") 
head(data)
##   enrollee_id     city city_development_index gender     relevent_experience
## 1        8949 city_103                  0.920   Male Has relevent experience
## 2       29725  city_40                  0.776   Male  No relevent experience
## 3       11561  city_21                  0.624   <NA>  No relevent experience
## 4       33241 city_115                  0.789   <NA>  No relevent experience
## 5         666 city_162                  0.767   Male Has relevent experience
## 6       21651 city_176                  0.764   <NA> Has relevent experience
##   enrolled_university education_level major_discipline experience company_size
## 1       no_enrollment        Graduate             STEM        >20         <NA>
## 2       no_enrollment        Graduate             STEM         15        50-99
## 3    Full time course        Graduate             STEM          5         <NA>
## 4                <NA>        Graduate  Business Degree         <1         <NA>
## 5       no_enrollment         Masters             STEM        >20        50-99
## 6    Part time course        Graduate             STEM         11         <NA>
##     company_type last_new_job training_hours target
## 1           <NA>            1             36      1
## 2        Pvt Ltd           >4             47      0
## 3           <NA>        never             83      0
## 4        Pvt Ltd        never             52      1
## 5 Funded Startup            4              8      0
## 6           <NA>            1             24      1

Database features

Now I will first see the structure of our data, analyse the format and work with the NAs.

str(data)
## 'data.frame':    19158 obs. of  14 variables:
##  $ enrollee_id           : int  8949 29725 11561 33241 666 21651 28806 402 27107 699 ...
##  $ city                  : chr  "city_103" "city_40" "city_21" "city_115" ...
##  $ city_development_index: num  0.92 0.776 0.624 0.789 0.767 0.764 0.92 0.762 0.92 0.92 ...
##  $ gender                : chr  "Male" "Male" NA NA ...
##  $ relevent_experience   : chr  "Has relevent experience" "No relevent experience" "No relevent experience" "No relevent experience" ...
##  $ enrolled_university   : chr  "no_enrollment" "no_enrollment" "Full time course" NA ...
##  $ education_level       : chr  "Graduate" "Graduate" "Graduate" "Graduate" ...
##  $ major_discipline      : chr  "STEM" "STEM" "STEM" "Business Degree" ...
##  $ experience            : chr  ">20" "15" "5" "<1" ...
##  $ company_size          : chr  NA "50-99" NA NA ...
##  $ company_type          : chr  NA "Pvt Ltd" NA "Pvt Ltd" ...
##  $ last_new_job          : chr  "1" ">4" "never" "never" ...
##  $ training_hours        : int  36 47 83 52 8 24 24 18 46 123 ...
##  $ target                : num  1 0 0 1 0 1 0 1 1 0 ...

We have many variables with a character type, which would not allow us to work with it later. For this reason, we will change them to factor, and later to numeric (unless the value is not numeric but categorical).

data$city = as.factor(data$city)
data$gender = as.factor(data$gender)
data$relevent_experience = as.factor(data$relevent_experience)
data$enrolled_university = as.factor(data$enrolled_university)
data$education_level = as.factor(data$education_level)
data$major_discipline = as.factor(data$major_discipline)
data$experience = as.factor(data$experience)
data$company_size = as.factor(data$company_size)
data$last_new_job = as.factor(data$last_new_job)
data$training_hours = as.integer(data$training_hours)
data$target = as.factor(data$target)


str(data)
## 'data.frame':    19158 obs. of  14 variables:
##  $ enrollee_id           : int  8949 29725 11561 33241 666 21651 28806 402 27107 699 ...
##  $ city                  : Factor w/ 123 levels "city_1","city_10",..: 6 78 65 15 51 58 50 84 6 6 ...
##  $ city_development_index: num  0.92 0.776 0.624 0.789 0.767 0.764 0.92 0.762 0.92 0.92 ...
##  $ gender                : Factor w/ 3 levels "Female","Male",..: 2 2 NA NA 2 NA 2 2 2 NA ...
##  $ relevent_experience   : Factor w/ 2 levels "Has relevent experience",..: 1 2 2 2 1 1 1 1 1 1 ...
##  $ enrolled_university   : Factor w/ 3 levels "Full time course",..: 2 2 1 NA 2 3 2 2 2 2 ...
##  $ education_level       : Factor w/ 5 levels "Graduate","High School",..: 1 1 1 1 3 1 2 1 1 1 ...
##  $ major_discipline      : Factor w/ 6 levels "Arts","Business Degree",..: 6 6 6 2 6 6 NA 6 6 6 ...
##  $ experience            : Factor w/ 22 levels "<1",">20","1",..: 2 9 18 1 2 5 18 7 20 11 ...
##  $ company_size          : Factor w/ 8 levels "<10","10/49",..: NA 6 NA NA 6 NA 6 1 6 5 ...
##  $ company_type          : chr  NA "Pvt Ltd" NA "Pvt Ltd" ...
##  $ last_new_job          : Factor w/ 6 levels ">4","1","2","3",..: 2 1 6 6 5 2 2 1 2 1 ...
##  $ training_hours        : int  36 47 83 52 8 24 24 18 46 123 ...
##  $ target                : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 1 2 2 1 ...
sum(is.na(data))
## [1] 20733

In Company size there is value 10/49 in some rowa, which is not the format in the rest of sizes of the company, so I will start cleaning this data changing the “/” sign to “-” (10/49 to 10-49).
Later we check that there are no “10-49” findings.

data$company_size = gsub("/","-",data$company_size)

On the other hand, there are about 20000 NAs.This amount of missing values would hurt our models, so let’s see where they are located exactly (numerically and graphically) and fix them with the mice function.

null_1 = sum(is.na(data$enrollee_id))
null_2 = sum(is.na(data$city))
null_3 = sum(is.na(data$gender))
null_4 = sum(is.na(data$city_development_index))
null_5 = sum(is.na(data$relevent_experience))
null_6 = sum(is.na(data$enrolled_university))
null_7 = sum(is.na(data$education_level))
null_8 = sum(is.na(data$major_discipline))
null_9 = sum(is.na(data$experience))
null_10 = sum(is.na(data$company_size))
null_11 = sum(is.na(data$company_type))
null_12 = sum(is.na(data$last_new_job))
null_13 = sum(is.na(data$training_hours))
null_14 = sum(is.na(data$target))
null_df = data.frame(variable = c("enrollee_id","city","city_development_index","gender","relevent_experience","enrolled_university","education_level","major_discipline","experience","company_size","company_type","last_new_job","training_hours","target"),
total_null = c(0,0,0,4508,0,386,460,2813,65,5938,6140,423,0,0))

null_df
##                  variable total_null
## 1             enrollee_id          0
## 2                    city          0
## 3  city_development_index          0
## 4                  gender       4508
## 5     relevent_experience          0
## 6     enrolled_university        386
## 7         education_level        460
## 8        major_discipline       2813
## 9              experience         65
## 10           company_size       5938
## 11           company_type       6140
## 12           last_new_job        423
## 13         training_hours          0
## 14                 target          0

Data visualization : NAs by column

null_df$variable = factor(null_df$variable,
levels = null_df$variable[order(null_df$total_null,decreasing = TRUE)])

ggplot(null_df,aes(x=variable,y=total_null))+ geom_bar(stat = "identity", col = "#FFD972", fill = "#09BC8A")+ theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
geom_label(aes(label = total_null, size = NULL), nudge_y = 0.7)+
theme(plot.title = element_text(hjust = 0.5))+
ggtitle("Total NAs by col")+
xlab("Missing values")

ylab("Total NAs")
## $y
## [1] "Total NAs"
## 
## attr(,"class")
## [1] "labels"
missing = is.na(data)
sum(missing)
## [1] 20733
missmap(data, main = 'Missing Map', col = c('#58A4B0', '#373F51'))

To fix the null values we have many tools: removing them straight, summing them, using the median and the mode, predicting them… First, I tried mice function tool.
Mice creates imputations (which are replacements for the missing values) for multivariable missing data.

#imp = mice(data, m = 5, method = #c("","","","pmm","pmm","pmm","pmm","pmm","pmm","pmm","pmm","pmm","",""),
#maxit = 20)

#final_data = complete(imp,5) 
#summary(final_data)

The main issue is that it is very time and power-consuming, so I discarded this option and I computed the most repeated factor for each variable, assigning each of them to their NA.
For example, men are the predominant gender so we may assume if there is a genderless cell, the person is more likely to be a man In the Exploratory Data Analysis part (right next to this one) we will explore these relations of predominance.
For now, we will use those conclusions and, in the next point, show it graphically:

data$city_development_index[which(is.na(data$city_development_index))] = mean(data$city_development_index, na.rm = TRUE)
data$gender[which(is.na(data$gender))] = "Male"
data$company_size[which(is.na(data$company_size))] = "50-99"
data$relevent_experience[which(is.na(data$relevent_experience))] = "No relevent experience"
data$enrolled_university[which(is.na(data$enrolled_university))] = "no_enrollment"
data$education_level[which(is.na(data$education_level))] = "Graduate"
data$major_discipline[which(is.na(data$major_discipline))] = "STEM"
data$experience[which(is.na(data$experience))] = ">20"
data$last_new_job[which(is.na(data$last_new_job))] = 1
data$training_hours[which(is.na(data$training_hours))] = mean(data$training_hours, na.rm = TRUE)

sum(data$target == 0)
## [1] 14381
sum(data$target == 1)
## [1] 4777
data$target[which(is.na(data$target))] = 0

Now, let’s consider the importance of variables: I will remove city index (I don’t find it useful if we don’t have any coordinates to know the city to which people belong), company type and enrollee ID, now that we know there are no repeated values for this variable.
We will work with these 11 variables left and check that there are no NA’s left.

data = select(data,-enrollee_id, -city,-company_type)
sum(is.na(data))
## [1] 0
missing = is.na(data)
sum(missing)
## [1] 0
missmap(data, main = 'Missing Map', col = c('#58A4B0', '#373F51'))

As there may be many profiles with the same characteristics, the ID number is unique for each of them. In this case, we conclude there are no duplicates.

Exploratory Data Analysis

By making some EDA work (note that we won’t take into account for the statistics the NAs), we discover that: * About 90% of the people are men.
* Over 70 % of the whole candidates have relevent experience, appreciating a trend where relevant experience is higher than non relevant experience when candidates attended to college. Overall, targets have higher not relevant experience than relevant experience who didn’t go to college.
* 73% of candidates were not enrolled in university by the time data were recopilated, and the other 27% are divided between full and part time courses.
* Roughly 86% of candidates went to college, and 60% of them have at most a degree; 22.7% a hold a Master’s degree, and 2.7% have a PhD.
* Most of these studies are STEM related (75%), humanities hold a 3.49% and business has a 1.7%.
* Unlike other variables, company size is fairly distributed but there is a slight trend in the candidates to work in a small size company.
* The intervals of 20+ years and 2 to 7 years if experience are very repeated. 20+ years of experience candidates are most likely to have relevant experience.
* Almost 1 out of 2 candidates have a year of difference between their current and their previous jobs, while above 17% have 4+ years of difference.
* Most of the candidates are from city index of 0.9. (highly developed).
The second most common interval is 0.6 ~ 0.7. City development index of 0.9+ is the majority for target 0’s, and 0.6 for target 1’s (looking for a new job).
* In our graph, 20 hours of training is the most common, and we do not appreciate big difference in the hours of training based on the relevent or irrelevant experience.
* Observation: People who has 20+ years of experience or 3 ~ 6 are very interested in changing jobs. The only difference between target 0 and 1 is the total candidates of 20+ group. 20+ group is the majority for the target 0. But candidates who has 20+ years and 3 ~ 6 years are almost equal for target 1’s.

ggplot(data, aes(x=gender))+
  geom_bar(fill = "#D66BA0")+
  geom_text(stat='count', aes(label=..count..), vjust=-1)+
  ggtitle("Gender Distribution")+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Gender")

ggplot(data,aes(x=company_size))+
  geom_bar(fill = "#D66BA0")+
  geom_text(stat='count', aes(label=..count..), vjust=-1)+
  theme(plot.title = element_text(hjust = 0.5))+
  ggtitle("Company size")+
  xlab("Size")+
  ylab("Count")

ggplot(data,aes(x=experience))+
  geom_bar(fill = "#D66BA0")+
  geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
  theme(plot.title = element_text(hjust = 0.5))+
  ggtitle("Experience")+
  xlab("Experience")+
  ylab("Count")

ggplot(data,aes(x= relevent_experience))+
         geom_bar(fill = "#D66BA0")+
         geom_text(stat='count', aes(label=..count..))+
         theme(plot.title = element_text(hjust = 0.5))+
         ggtitle("Relevant Experience")+
         xlab("Relevant experience")

         ylab("Count")
## $y
## [1] "Count"
## 
## attr(,"class")
## [1] "labels"
ggplot(data,aes(x=relevent_experience))+
  geom_bar(fill = "#D66BA0")+
  facet_wrap(~education_level)+
  ggtitle("Relevent experience by education level")+
  theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Experience")+
  ylab("Count")

ggplot(data,aes(x=enrolled_university,fill = relevent_experience))+
  geom_bar()+
  facet_wrap(~relevent_experience)+
  theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_text(stat='count', aes(label=..count..), vjust=-1)+
  ggtitle("Enrolled University")+
  xlab("Enrolled University")+
  ylab("Count")

ggplot(data,aes(x=major_discipline,fill=relevent_experience))+
  geom_bar()+
  facet_wrap(~gender)+
  theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
  ggtitle("College major")+
  xlab("Major")+
  ylab("Count")

ggplot(data,aes(x=experience))+
  geom_bar(fill = "#D66BA0")+
  facet_wrap(~relevent_experience)+
  theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
  geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
  theme(plot.title = element_text(hjust = 0.5))+
  ggtitle("Count by relevent experience")+
  xlab("Experience")+
  ylab("Count")

ggplot(data,aes(x=city_development_index,fill = relevent_experience))+
  geom_bar()+
  theme(plot.title = element_text(hjust = 0.5))+
  ggtitle("City development index")+
  xlab("City development")+
  ylab("Count")

ggplot(data,aes(x= training_hours,fill = relevent_experience))+
geom_density()

ggplot(data,aes(x= training_hours,fill = relevent_experience))+
  geom_density()+
  facet_wrap(~relevent_experience)+
  theme(plot.title = element_text(hjust = 0.5))+
  ggtitle("Training hours by relevent experience")+
  xlim(0,60)

ggplot(data,aes(x=education_level,fill = education_level))+
  geom_bar()+
  facet_wrap(~target)+
  geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
  ggtitle("Target by education level")+
  xlab("Target")+
  ylab("Count")

ggplot(data,aes(x=major_discipline,fill = major_discipline))+
  geom_bar()+
  facet_wrap(~target)+
  theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
  geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
  theme(plot.title = element_text(hjust = 0.5))+
  ggtitle("Target by Major discipline")+
  xlab("Target")+
  ylab("Count")

ggplot(data,aes(x=city_development_index,fill = relevent_experience))+
  geom_density(alpha = 0.5)+
  facet_wrap(~target)+
  theme(plot.title = element_text(hjust = 0.5))+
  ggtitle("Target by city development index")+
  xlab("Target")+
  ylab("Count")

ggplot(data,aes(x=enrolled_university,fill = enrolled_university))+
  geom_bar()+
  facet_wrap(~target)+
  theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
  geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
  theme(plot.title = element_text(hjust = 0.5))+
  ggtitle("Target by enrolled university")+
  xlab("Target")+
  ylab("Count")

In relation with the outliers, we will take the numerical variable training_hours to see if it is balanced or there are many outliers that will affect our computations:

  ggplot(data)+
    geom_boxplot(mapping = aes(x= gender,y= training_hours,fill=gender))+
      coord_flip()+
      ggtitle(" hours of training by gender")

The conclusion we can extract from this boxplot is that there is a lot of variability in our data: there are no clear outliers because many cases are out of the expected limits, creating unbalanced data. SMOTE method would be a solution to balance the data by oversampling, but it is not available for all versions of R.

Principal Component Analysis

data$city = as.numeric(data$city)
data$relevent_experience = as.numeric(data$relevent_experience)
data$gender = as.numeric(data$gender)
data$experience = as.numeric(data$experience)
data$last_new_job = as.numeric(data$last_new_job)
data$training_hours = as.numeric(data$training_hours)
data$target = as.numeric(data$target)
data$education_level = as.numeric(data$education_level)
data$major_discipline = as.numeric(data$major_discipline)
data$enrolled_university = as.numeric(data$enrolled_university)
data$company_size = as.factor(data$company_size)
data$company_size = as.numeric(data$company_size)

data = select(data, -city)
corrplot(cor(data), is.corr = F,
number.cex = 0.4, tl.cex = 0.4, cl.cex = 0.4)

We see the pairs that save more correlation are target with city development index, and last_new_job with relevant_experience.

pca = prcomp(data, scale = T, center = T)
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.3516 1.1289 1.04181 1.01213 0.99899 0.98635 0.94746
## Proportion of Variance 0.1661 0.1159 0.09867 0.09313 0.09073 0.08844 0.08161
## Cumulative Proportion  0.1661 0.2819 0.38060 0.47373 0.56445 0.65290 0.73450
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.93341 0.89518 0.81149 0.76768
## Proportion of Variance 0.07921 0.07285 0.05987 0.05358
## Cumulative Proportion  0.81371 0.88656 0.94642 1.00000

I have created a table with the importance of each component in the total representation, and now let’s check that the new database is uncorrelated and the percentage of the variance explained by each PC. The percentage explained by the a component is the variance of a PC divided by the sum of the variance of all the PCs.

corrplot(cor(pca$x), is.corr = F,
number.cex = 0.4, tl.cex = 0.4, cl.cex = 0.4)

fviz_screeplot(pca, addlabels = TRUE)

variances = data.frame(Dimensions = 1:11,
Percentage = pca$sdev^2/sum(pca$sdev^2)*100)
ggplot(variances)+aes(x = Dimensions, y = Percentage)+
geom_bar(stat = "identity", fill = "#5B84B1FF")+
geom_point()+geom_line()+ theme_minimal() +
scale_x_continuous(breaks = 1:10) +
labs(title = "Scree plot") + theme(text = element_text(size = 6)) +
ylab("Percentage of explained variance")

We can visualize the quality of the representation of the variables with the cos2, for which you have to calculate the correlation between the original dataset and the projected one.

fviz_pca_var(pca, col.var="cos2",repel = TRUE, labelsize = 2,
gradient.cols = c("blue", "orange"))+
theme(text = element_text(size = 6))

The main conclusion we can extract from our PCA is that there are not many strong relations, and that we can explain half of the model by taking the first four components. What would have happened if we scaled the data?

pca2 = prcomp(data, center = T, scale. = T); summary(pca2)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.3516 1.1289 1.04181 1.01213 0.99899 0.98635 0.94746
## Proportion of Variance 0.1661 0.1159 0.09867 0.09313 0.09073 0.08844 0.08161
## Cumulative Proportion  0.1661 0.2819 0.38060 0.47373 0.56445 0.65290 0.73450
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.93341 0.89518 0.81149 0.76768
## Proportion of Variance 0.07921 0.07285 0.05987 0.05358
## Cumulative Proportion  0.81371 0.88656 0.94642 1.00000
fviz_screeplot(pca2, addlabels = TRUE)

#comparison of both pca and pca2
barplot(pca$rotation[,1], las=2, col="#E16036")

barplot(pca2$rotation[,1], las=2, col="#2E294E")

As we see, the distribution of our data happens to be the same in both cases; the same variables have importance and represent the same quantity of information.
In some databases, if we transformed some factors to numeric values it may suppose a problem to scale the data. On the other hand, whenever we have all variables in different units, it will me helpful to scale them in order to fit the circle better.

Secondly, let’s create another database by discarding not important variables (based on the graphs), and see if we reduce noise:

data3 = select(data, -company_size, -training_hours, -education_level)
pca3 = prcomp(data3, center = T, scale. = F); summary(pca3)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     7.1628 1.57112 0.94865 0.52672 0.42187 0.38559 0.26679
## Proportion of Variance 0.9267 0.04459 0.01626 0.00501 0.00321 0.00269 0.00129
## Cumulative Proportion  0.9267 0.97132 0.98757 0.99259 0.99580 0.99849 0.99977
##                            PC8
## Standard deviation     0.11254
## Proportion of Variance 0.00023
## Cumulative Proportion  1.00000
corrplot(cor(data3), is.corr = F,
number.cex = 0.4, tl.cex = 0.4, cl.cex = 0.4)

fviz_screeplot(pca3, addlabels = TRUE)

In this case, results are totally different. Only with the first three variables, we have an 98.8% of information.

barplot(pca3$rotation[,1], las=2, col="#0FFF95")

In this case, it is clear that relevant_experience is the most important variable for the model, followed by last_new_job and enrolled_university. We can see that relations between our first two models and this third one (in which we tried to reduce noise) keep more or less the correlations.

pca4 = prcomp(data3, center = T, scale. = T); summary(pca3)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     7.1628 1.57112 0.94865 0.52672 0.42187 0.38559 0.26679
## Proportion of Variance 0.9267 0.04459 0.01626 0.00501 0.00321 0.00269 0.00129
## Cumulative Proportion  0.9267 0.97132 0.98757 0.99259 0.99580 0.99849 0.99977
##                            PC8
## Standard deviation     0.11254
## Proportion of Variance 0.00023
## Cumulative Proportion  1.00000
corrplot(cor(data3), is.corr = F,
number.cex = 0.4, tl.cex = 0.4, cl.cex = 0.4)

fviz_screeplot(pca4, addlabels = TRUE)

This PCA is more sensible to scaling, and we appreciate that this time the weights of the variables are more distributed across the first components, which is curious. This way, we eliminate that balance issue of the database.

barplot(pca4$rotation[,1], las=2, col="#0FFF95")

Factor analysis

Factor analysis is a statistical data reduction technique used to explain the correlations between observed variables in terms of a smaller number of unobserved variables called factors. When setting up the differences between factor analysis and other techniques, it is vital to know that restrictions on linearity and normality are more strict in comparison. Our first worry is how many factors should we use; as I have been taking into account all the time how much information the model represents in the first three components, I will use 3 by default.

fanal1 = factanal(data, 3, rotation = "varimax", scores = "regression")
fanal1
## 
## Call:
## factanal(x = data, factors = 3, scores = "regression", rotation = "varimax")
## 
## Uniquenesses:
## city_development_index                 gender    relevent_experience 
##                  0.685                  0.998                  0.390 
##    enrolled_university        education_level       major_discipline 
##                  0.834                  0.973                  0.948 
##             experience           company_size           last_new_job 
##                  0.860                  0.931                  0.832 
##         training_hours                 target 
##                  0.999                  0.005 
## 
## Loadings:
##                        Factor1 Factor2 Factor3
## city_development_index -0.283          -0.480 
## gender                                        
## relevent_experience             0.780         
## enrolled_university            -0.362  -0.177 
## education_level                        -0.118 
## major_discipline                        0.228 
## experience                      0.134   0.349 
## company_size                    0.247         
## last_new_job                    0.360   0.195 
## training_hours                                
## target                  0.984   0.116   0.115 
## 
##                Factor1 Factor2 Factor3
## SS loadings      1.068   0.973   0.505
## Proportion Var   0.097   0.088   0.046
## Cumulative Var   0.097   0.186   0.231
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 381.76 on 25 degrees of freedom.
## The p-value is 1.67e-65
fanal2 = factanal(data, 3, rotation = "none", scores = "regression")
fanal2
## 
## Call:
## factanal(x = data, factors = 3, scores = "regression", rotation = "none")
## 
## Uniquenesses:
## city_development_index                 gender    relevent_experience 
##                  0.685                  0.998                  0.390 
##    enrolled_university        education_level       major_discipline 
##                  0.834                  0.973                  0.948 
##             experience           company_size           last_new_job 
##                  0.860                  0.931                  0.832 
##         training_hours                 target 
##                  0.999                  0.005 
## 
## Loadings:
##                        Factor1 Factor2 Factor3
## city_development_index -0.343           0.437 
## gender                                        
## relevent_experience     0.130   0.762   0.111 
## enrolled_university    -0.122  -0.368   0.124 
## education_level                         0.118 
## major_discipline                       -0.226 
## experience                      0.168  -0.327 
## company_size            0.102   0.229         
## last_new_job                    0.379  -0.151 
## training_hours                                
## target                  0.997                 
## 
##                Factor1 Factor2 Factor3
## SS loadings      1.170   0.954   0.422
## Proportion Var   0.106   0.087   0.038
## Cumulative Var   0.106   0.193   0.231
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 381.76 on 25 degrees of freedom.
## The p-value is 1.67e-65
cbind(fanal1$loadings, fanal1$uniquenesses)
##                            Factor1      Factor2     Factor3          
## city_development_index -0.28326449 -0.066857440 -0.47974019 0.6851434
## gender                 -0.01303137  0.006317856  0.04658683 0.9976168
## relevent_experience     0.03988904  0.780056966 -0.01178520 0.3897815
## enrolled_university    -0.05931015 -0.362004784 -0.17688471 0.8341446
## education_level        -0.08096397  0.080505905 -0.11821733 0.9729891
## major_discipline       -0.01365783  0.004182644  0.22768405 0.9479514
## experience              0.01377505  0.133748076  0.34899536 0.8601227
## company_size            0.07890051  0.247431743 -0.04508652 0.9305236
## last_new_job           -0.02074055  0.360125061  0.19511291 0.8318011
## training_hours         -0.02201464 -0.014295825  0.01519208 0.9990832
## target                  0.98402823  0.116063458  0.11496846 0.0050000
cbind(fanal2$loadings, fanal2$uniquenesses)
##                             Factor1      Factor2      Factor3          
## city_development_index -0.343198908 -0.080846443  0.436506244 0.6851434
## gender                 -0.006668926  0.012987558 -0.046550120 0.9976168
## relevent_experience     0.130161918  0.762190626  0.111095547 0.3897815
## enrolled_university    -0.121909998 -0.368158317  0.124301108 0.8341446
## education_level        -0.084114978  0.077107642  0.118282651 0.9729891
## major_discipline        0.013587355  0.030616715 -0.225659161 0.9479514
## experience              0.070105669  0.167865929 -0.326775589 0.8601227
## company_size            0.101791249  0.228602392  0.082825361 0.9305236
## last_new_job            0.044875196  0.378766910 -0.150702098 0.8318011
## training_hours         -0.021626402 -0.009556137 -0.018994553 0.9990832
## target                  0.997494132 -0.002014260  0.001193804 0.0050000

Observation: Factor 1 is clearly related to target (variance explained at 99%), Factor 2 with relevant_experience, Factor 3 is not strongly correlated with any of them, but it is the most correlated with the city_development_index.

barplot(fanal1$loadings[,1], las=2, col="#85BAA1", ylim = c(-1, 1))

barplot(fanal1$loadings[,2], las=2, col="#F4E04D", ylim = c(-1, 1))

barplot(fanal1$loadings[,3], las=2, col="#D4CBE5", ylim = c(-1, 1))

barplot(fanal2$loadings[,1], las=2, col="#AD5D4E", ylim = c(-1, 1))

barplot(fanal2$loadings[,2], las=2, col="#40476D", ylim = c(-1, 1))

barplot(fanal2$loadings[,3], las=2, col="#826754", ylim = c(-1, 1))

Insights: fanal1 has similar results compared to PCA, but in general the weights are less concentrated and more distributed along all variables. What would happen if we try other different factoring models?

#fm = ml -> maximum likelihood
#fm = pa -> principal axis factor analysis
#fm = minres -> minimum residual

fanal3= fa(r = data, 3, rotate = "none", scores = "regression", fm = "ml", residuals = T)
fanal3
## Factor Analysis using method =  ml
## Call: fa(r = data, nfactors = 3, rotate = "none", scores = "regression", 
##     residuals = T, fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                          ML1   ML2   ML3      h2    u2 com
## city_development_index -0.34 -0.08  0.44 0.31489 0.685 2.0
## gender                 -0.01  0.01 -0.05 0.00238 0.998 1.2
## relevent_experience     0.13  0.76  0.11 0.61024 0.390 1.1
## enrolled_university    -0.12 -0.37  0.12 0.16585 0.834 1.5
## education_level        -0.08  0.08  0.12 0.02701 0.973 2.6
## major_discipline        0.01  0.03 -0.23 0.05204 0.948 1.0
## experience              0.07  0.17 -0.33 0.13986 0.860 1.6
## company_size            0.10  0.23  0.08 0.06948 0.931 1.7
## last_new_job            0.04  0.38 -0.15 0.16818 0.832 1.3
## training_hours         -0.02 -0.01 -0.02 0.00092 0.999 2.4
## target                  1.00  0.00  0.00 0.99500 0.005 1.0
## 
##                        ML1  ML2  ML3
## SS loadings           1.17 0.95 0.42
## Proportion Var        0.11 0.09 0.04
## Cumulative Var        0.11 0.19 0.23
## Proportion Explained  0.46 0.37 0.17
## Cumulative Proportion 0.46 0.83 1.00
## 
## Mean item complexity =  1.6
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  55  and the objective function was  0.49 with Chi Square of  9426.08
## The degrees of freedom for the model are 25  and the objective function was  0.02 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  19158 with the empirical chi square  635.46  with prob <  4.6e-118 
## The total number of observations was  19158  with Likelihood Chi Square =  381.76  with prob <  1.7e-65 
## 
## Tucker Lewis Index of factoring reliability =  0.916
## RMSEA index =  0.027  and the 90 % confidence intervals are  0.025 0.03
## BIC =  135.25
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy             
##                                                    ML1  ML2   ML3
## Correlation of (regression) scores with factors   1.00 0.81  0.60
## Multiple R square of scores with factors          1.00 0.66  0.36
## Minimum correlation of possible factor scores     0.99 0.32 -0.28
fanal4 = fa(r = data, 3, rotate = "none", scores = "regression", fm = "pa", residuals = T)
fanal4
## Factor Analysis using method =  pa
## Call: fa(r = data, nfactors = 3, rotate = "none", scores = "regression", 
##     residuals = T, fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                          PA1   PA2   PA3     h2   u2 com
## city_development_index -0.44  0.36  0.23 0.3757 0.62 2.5
## gender                  0.02  0.00 -0.05 0.0032 1.00 1.2
## relevent_experience     0.58  0.48  0.12 0.5834 0.42 2.0
## enrolled_university    -0.37 -0.14  0.06 0.1626 0.84 1.3
## education_level        -0.04  0.16  0.04 0.0277 0.97 1.2
## major_discipline        0.10 -0.04 -0.20 0.0519 0.95 1.5
## experience              0.25 -0.03 -0.26 0.1322 0.87 2.0
## company_size            0.21  0.14  0.12 0.0785 0.92 2.3
## last_new_job            0.34  0.21 -0.15 0.1789 0.82 2.1
## training_hours         -0.02  0.00 -0.03 0.0013 1.00 1.6
## target                  0.55 -0.46  0.30 0.6067 0.39 2.5
## 
##                        PA1  PA2  PA3
## SS loadings           1.20 0.68 0.31
## Proportion Var        0.11 0.06 0.03
## Cumulative Var        0.11 0.17 0.20
## Proportion Explained  0.55 0.31 0.14
## Cumulative Proportion 0.55 0.86 1.00
## 
## Mean item complexity =  1.9
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  55  and the objective function was  0.49 with Chi Square of  9426.08
## The degrees of freedom for the model are 25  and the objective function was  0.02 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  19158 with the empirical chi square  644.28  with prob <  6.6e-120 
## The total number of observations was  19158  with Likelihood Chi Square =  407.51  with prob <  9e-71 
## 
## Tucker Lewis Index of factoring reliability =  0.91
## RMSEA index =  0.028  and the 90 % confidence intervals are  0.026 0.031
## BIC =  161
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy             
##                                                    PA1  PA2   PA3
## Correlation of (regression) scores with factors   0.83 0.77  0.58
## Multiple R square of scores with factors          0.69 0.59  0.33
## Minimum correlation of possible factor scores     0.39 0.18 -0.33
fanal5 = fa(r = data, 3, rotate = "none", scores = "regression", fm = "minres", residuals = T)
fanal5
## Factor Analysis using method =  minres
## Call: fa(r = data, nfactors = 3, rotate = "none", scores = "regression", 
##     residuals = T, fm = "minres")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                          MR1   MR2   MR3      h2     u2 com
## city_development_index -0.42  0.12  0.35 0.31648 0.6835 2.1
## gender                  0.01  0.01 -0.05 0.00250 0.9975 1.3
## relevent_experience     0.46  0.59  0.19 0.59740 0.4026 2.1
## enrolled_university    -0.31 -0.25  0.07 0.16369 0.8363 2.0
## education_level        -0.06  0.12  0.10 0.02737 0.9726 2.5
## major_discipline        0.08  0.02 -0.20 0.04793 0.9521 1.3
## experience              0.21  0.10 -0.29 0.14131 0.8587 2.1
## company_size            0.18  0.16  0.12 0.07398 0.9260 2.7
## last_new_job            0.26  0.30 -0.11 0.17081 0.8292 2.2
## training_hours         -0.02  0.00 -0.02 0.00095 0.9990 1.9
## target                  0.83 -0.51  0.21 0.99550 0.0045 1.8
## 
##                        MR1  MR2  MR3
## SS loadings           1.35 0.82 0.38
## Proportion Var        0.12 0.07 0.03
## Cumulative Var        0.12 0.20 0.23
## Proportion Explained  0.53 0.32 0.15
## Cumulative Proportion 0.53 0.85 1.00
## 
## Mean item complexity =  2
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  55  and the objective function was  0.49 with Chi Square of  9426.08
## The degrees of freedom for the model are 25  and the objective function was  0.02 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  19158 with the empirical chi square  623.7  with prob <  1.3e-115 
## The total number of observations was  19158  with Likelihood Chi Square =  387.54  with prob <  1.1e-66 
## 
## Tucker Lewis Index of factoring reliability =  0.915
## RMSEA index =  0.028  and the 90 % confidence intervals are  0.025 0.03
## BIC =  141.03
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy             
##                                                    MR1  MR2   MR3
## Correlation of (regression) scores with factors   0.94 0.86  0.62
## Multiple R square of scores with factors          0.88 0.74  0.39
## Minimum correlation of possible factor scores     0.76 0.48 -0.23
cbind(fanal3$loadings, fanal3$uniquenesses)
##                                 ML1          ML2          ML3            
## city_development_index -0.343199011 -0.080843032  0.436537369 0.685113968
## gender                 -0.006668923  0.012987000 -0.046550366 0.997619927
## relevent_experience     0.130161988  0.762202658  0.111086313 0.389764797
## enrolled_university    -0.121910020 -0.368153001  0.124303483 0.834149959
## education_level        -0.084114975  0.077108247  0.118276328 0.972989699
## major_discipline        0.013587365  0.030613175 -0.225654510 0.947958259
## experience              0.070105684  0.167858998 -0.326759640 0.860136688
## company_size            0.101791260  0.228601315  0.082821842 0.930520521
## last_new_job            0.044875222  0.378758785 -0.150702942 0.831816621
## training_hours         -0.021626403 -0.009556248 -0.018993591 0.999080220
## target                  0.997494132 -0.002014322  0.001193931 0.004999973
cbind(fanal4$loadings, fanal4$uniquenesses)
##                                PA1           PA2         PA3          
## city_development_index -0.44066277  0.3578821540  0.23120813 0.6242795
## gender                  0.01662688  0.0005612527 -0.05407884 0.9967987
## relevent_experience     0.57983945  0.4824826929  0.12001299 0.4165935
## enrolled_university    -0.37450520 -0.1363938701  0.06118041 0.8373995
## education_level        -0.04043568  0.1571573281  0.03696811 0.9722999
## major_discipline        0.09543141 -0.0448847006 -0.20194647 0.9480958
## experience              0.25179142 -0.0303237737 -0.26050323 0.8678196
## company_size            0.21406909  0.1350776827  0.11991156 0.9215497
## last_new_job            0.33797945  0.2074958194 -0.14709793 0.8210776
## training_hours         -0.01723893  0.0018227521 -0.03118792 0.9987268
## target                  0.54732911 -0.4638655534  0.30324810 0.3933002
cbind(fanal5$loadings, fanal5$uniquenesses)
##                                MR1          MR2         MR3            
## city_development_index -0.42051456  0.121989344  0.35321607 0.683524508
## gender                  0.01100262  0.013265017 -0.04689016 0.997504295
## relevent_experience     0.46406157  0.587818416  0.19108736 0.402601990
## enrolled_university    -0.31378576 -0.245361912  0.07086068 0.836314791
## education_level        -0.05877842  0.115014904  0.10339234 0.972626693
## major_discipline        0.07813839  0.021215458 -0.20341380 0.952067123
## experience              0.21436123  0.095003478 -0.29383233 0.858686162
## company_size            0.18348813  0.160843943  0.12018643 0.926016552
## last_new_job            0.26117817  0.302597940 -0.10504208 0.829186613
## training_hours         -0.01859798  0.001168926 -0.02464712 0.999045268
## target                  0.83467318 -0.506076241  0.20665411 0.004501593

Interpretation: Analysing these tables in detail, the peak (in all cases) of h2 belongs to relevant_experience, followed by city_development_index and last_new_job. A possible interpretation would be that in highly developed cities it is more common for data scientists to get better conditions, increasing temporaryness (last_new_job) because the market that requires them is considerably more competitive than in a country with lower development index (with less technological breakthroughs). Moreover, highly developed countries tend to offer more stable positions to gain experience in the field.

barplot(fanal3$loadings[,1], las=2, col="#85BAA1", ylim = c(-1, 1))

barplot(fanal3$loadings[,2], las=2, col="#F4E04D", ylim = c(-1, 1))

barplot(fanal3$loadings[,3], las=2, col="#D4CBE5", ylim = c(-1, 1))

barplot(fanal4$loadings[,1], las=2, col="#AD5D4E", ylim = c(-1, 1))

barplot(fanal4$loadings[,2], las=2, col="#40476D", ylim = c(-1, 1))

barplot(fanal4$loadings[,3], las=2, col="#826754", ylim = c(-1, 1))

barplot(fanal5$loadings[,1], las=2, col="#2589BD", ylim = c(-1, 1))

barplot(fanal5$loadings[,2], las=2, col="#38686A", ylim = c(-1, 1))

barplot(fanal5$loadings[,3], las=2, col="#B9F5D8", ylim = c(-1, 1))

As we expected, there are many levels to explain factors, for example: while in maximum likelihood weights are less but more concentrated and it is possible to explain most of the variance with two factors, in minimum residual it is not possible because representation is low in each variable. In my opinion, fanal3 and fanal4 lead to a better understanding of the data.

fa.diagram(fanal3); fa.diagram(fanal4); fa.diagram(fanal5)

Clustering

NOTE: This process is very time and power consuming, so my computer cannot handle these procedures with such big database. I will comment the whole process from now on to be able to export this file.

Clustering is the process of making a group of abstract objects into classes of similar objects. We will start by K-Means (with scaled data), whose process is: 1. Select k centroids randomly 2. Assign each point to the nearest centroid 3. Recalculate centroids based on assigned classes 4. Repeat 2 and 3 until the centroids do not change

scaled_data = scale(data) 
k1 = kmeans(scaled_data, centers = 5, nstart = 100) 
k1_cluster = k1$cluster
k1_centers = k1$centers

 #graphical representation
barplot(table(k1_cluster), col="#858AE3", xlab = "Cluster", ylab = "Observations")

Now we can plot the position of the center of the cluster vs the global center

# plotting centers in cluster 1 
bar1 = barplot(k1_centers[1,], las=2, col="#AAF683",ylim = c(-2,2),
main= paste("Cluster 1: center (green) and global center (pink)"))
points(bar1, y = apply(scaled_data, 2, quantile, 0.50),col="#E85D75", pch=19)

#plotting the clusters
fviz_cluster(k1, data = scaled_data, labelsize = 6) +
theme(text = element_text(size = 6))

Question: which is the optimal number of clusters? Let’s discover it.

#fviz_nbclust(scaled_data, kmeans, method = 'wss')

Now we apply the elbow rule and we conclude the ideal number of clusters is 3.

#fviz_nbclust(scaled_data, kmeans, method = 'silhouette')
#fviz_nbclust(scaled_data, kmeans, method = 'gap_stat')

When applying the silhouette method, we state that three are the ideal no. of clusters.

k2 = kmeans(scaled_data, centers = 3, nstart = 100)

k2_clusters = k2$cluster
k2_centers = k2$centers

bar2 = barplot(k2_centers[1,], las= 2, col="#214E34",ylim = c(-5,5), xlab = "Appropriate cluster", ylab = "appropriate observations")
  points(bar2 ,y = apply(scaled_data, 2, quantile, 0.5), col ="#C6CCB2", pch = 20 )

We have a better cluster and a worse one due to the mean being off.

b2 = barplot(k2_centers[3,], las = 2, col="#FFC15E",ylim = c(-2,2), xlab = "Inappropriate cluster", ylab = "inappropriate observations")
  points(bar2 ,y = apply(scaled_data, 2, quantile, 0.5), col ="#BF4E30", pch = 20 )

CLUSPOT

fviz_cluster(k2, data = scaled_data, repel = F, labelsize = 1, ellipse.type = "norm") + theme_light()

There are three groups, one of them in between of the first and the third one, which are more differentiated. That may be why gap_stat decided different clusters for our model.

Kmeans with Minkowski and Mahalanobis distances

k2_min = eclust(scaled_data, "kmeans", stand=T, k=3, graph = T, hc_metric = "minkowski")

Three clusters with an overlap, what if we try with two clusters?

k2_min2 = eclust(scaled_data, "kmeans", stand=T, k=2, graph = T, hc_metric = "minkowski")

The intersect is smaller now. Now let’s try with Mahalanobis distance:

#AUTHOR NOTE: very time consuming
#S_x = cov(data)
#iS = solve(S_x)
#e = eigen(iS)
#V = e$vectors
#B = V %*% diag(sqrt(e$values)) %*% t(V)
#Xtil = scale(data,scale = FALSE)
#dataS = Xtil %*% B

Now we create our k-means models (we try with both 2 and 3 clusters to compare):

#k2_mah = kmeans(dataS, centers=3, nstart=100)
#k2_mah2 = kmeans(dataS, centers=2, nstart=100)
#graphically:
#fviz_cluster(k2_mah, data = scaled_data, stand = T, ellipse =  T, labelsize = 1)
#fviz_cluster(k2_mah2, data = scaled_data, stand = T, ellipse =  T, labelsize = 1)

How much do they intersect? (answer: in very few points)

#adjustedRandIndex(k2_mah$cluster, k2_mah2$cluster) 

PAM method

Let’s try other approaches such as “PAM” or hierarchical clustering:

Optimal number of clusters?

#AUTHOR NOTE: very time consuming
#fviz_nbclust(scaled_data, pam, method = 'wss', k.max = 20)
#AUTHOR NOTE: very time consuming
#fviz_nbclust(scaled_data, pam, method = 'silhouette', k.max = 20)

By elbow method, 9/10. By the silhouette method, around 16.

#k3 = eclust(scaled_data, "pam", stand=T, k=16, graph = F)
#fviz_cluster(k3, data = scaled_data, stand = T, ellipse =  T, labelsize = 1)
#silhouette plot for more info. 
#fviz_silhouette(k3)

Let’s see if Minkoski helps avoiding overlapping:

#k3_min1 = eclust(scaled_data, "pam", stand=T, k=16, graph = T,geom , hc_metric = "minkowski")
#differences in clusters changing the distance method
#adjustedRandIndex(k3_min1$cluster, k3$cluster) 

We confirm that the method we choose to compute distances matter.

Hierarchical clustering

Matrices of distances:

#matrix_d1 = dist(data, method = "euclidean")

#matrix_d2 = dist(data, method = "minkowski", p = 2)

Optimal no. of clusters?

#n1=NbClust(data, distance = "euclidean", min.nc=2, max.nc=8, 
#             method = "complete", index = "all")

The best number is 2. We try hierarchical clustering with complete and ward.D2.

#k4_1 = hclust(matrix_d1, method = "complete")
#colors = color_branches(as.dendrogram(k4_1), k  =2)
#plot(colors)

k = 3 is a very reliable option.

#cluster1 = cutree(k4_1, k = 2)
#table(cluster1)

#difference in clusters' size
#k4_2 = hclust(matrix_d1, method = "ward.D2")
#colors = color_branches(as.dendrogram(k4_2), k  =3)
#plot(colors)
#cluster2 = cutree(k4.2, k = 3)
#table(cluster2)
#k4_2_v1 = hclust(matrix_d1, method = "ward.D2")
#colors = color_branches(as.dendrogram(k4_2_v1), k = 6)
#plot(colors)
#cluster2_v1 = cutree(k4_2_v1, k = 6)
#table(cluster2_v1)

There are two different (big) clusters and then many small ones. k= 3 would be a good option, asumming all small clusters shape one big cluster.

Trying “complete” method:

#k4_3 = hclust(matrix_d2, method = "complete")
#colors = color_branches(as.dendrogram(k4_3), k = 2)
#plot(colors)
#cluster3 = cutree(k4_3, k = 2)
#table(cluster3)

Trying “ward.D2” method?

#k4_4 = hclust(matrix_d2, method = "ward.D2")
#colors = color_branches(as.dendrogram(k4_4), k=3)
#plot(colors)
#cluster4 = cutree(k4_4, k = 3)
#table(cluster4)

Intersection of both clusters:

#adjustedRandIndex(cluster2, cluster4) 

#they are the same tree
#k4_5 = hclust(matrix_d2, method = "ward.D2")
#colors = color_branches(as.dendrogram(k4_5), k=6)
#plot(colors)
#cluster5 = cutree(k4_4, k = 6)
#table(cluster5)

Again 3 or 4 would fit perfectly not to have many information in the same clusters nor too few.

Conclusion

With this task we can confirm that the study of a database and the techniques previously practised are as variable and subjective as data science itself; but even with this advantage and drawback, the information we can handle thanks to Unsupervised Learning is huge (and this is an example).

Sources